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Abstract 

The problem of consistently estimating the sparsity pattern of a vector f3* € R p based 
on observations contaminated by noise arises in various contexts, including subset selection in 
regression, structure estimation in graphical models, sparse approximation, and signal denoising. 
We analyze the behavior of ^-constrained quadratic programming (QP), also referred to as the 
Lasso, for recovering the sparsity pattern. Our main result is to establish a sharp relation 
between the problem dimension p, the number s of non-zero elements in [3* , and the number of 
observations n that are required for reliable recovery. For a broad class of Gaussian ensembles 
satisfying mutual incoherence conditions, we establish existence and compute explicit values of 
thresholds 6g and U with the following properties: for any v > 0, if n > 2 {9 u +u) log(p— s)+s+l, 
then the Lasso succeeds in recovering the sparsity pattern with probability converging to one 
for large problems, whereas for n < 2 (9e — v) log(p— s) +s + 1, then the probability of successful 
recovery converges to zero. For the special case of the uniform Gaussian ensemble, we show that 
9 1 = 8 U = 1, so that the threshold is sharp and exactly determined. 

Keywords: Quadratic programming; Lasso; subset selection; consistency; thresholds; sparse ap- 
proximation; signal denoising; sparsity recovery; ^o-regularization; model selection. 

1 Introduction 

The problem of recovering the sparsity pattern of an unknown vector (3* — that is, the positions 
of the non-zero entries of (5* — based on noisy observations arises in a broad variety of contexts, 
including subset selection in regression [22], structure estimation in graphical models [2Sj, sparse 
approximation jSJ ED] , and signal denoising ^6 . A natural optimization-theoretic formulation of 
this problem is via -^-minimization, where the Iq "norm" of a vector corresponds to the number of 
non-zero elements. Unfortunately, however, ^-minimization problems are known to be NP-hard in 
general |3Uj . so that the existence of polynomial-time algorithms is highly unlikely. This challenge 
motivates the use of computationally tractable approximations or relaxations to Iq minimization. 
In particular, a great deal of research over the past decade has studied the use of the ^i-norm as a 
computationally tractable surrogate to the 4r n orm. 
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In more concrete terms, suppose that we wish to estimate an unknown but fixed vector (3* G W 
on the basis of a set of n observations of the form 



Y k = xiP* + W k , k = l,...n, (1) 

where x k £ K p , and W k ~ iV(0,<7 2 ) is additive Gaussian noise. In many settings, it is natural to 
assume that the vector /3* is sparse, in that its support 



S := {i€{l,...p} | 0*^0} (2) 

has relatively small cardinality s = \S\. Given the observation model (0) and sparsity assump- 
tion ((2J), a reasonable approach to estimating (3* is by solving the ^i-constrained quadratic program 

(QP) 

where A„, > is a regularization parameter. Of interest are conditions on the ambient dimension 
p, the sparsity index s, and the number of observations n for which it is possible (or impossible) to 
recover the support set S of (3*. 



1.1 Overview of previous work 

Given the substantial literature on the use of t\ constraints for sparsity recovery and subset selec- 
tion, we provide only a very brief (and hence necessarily incomplete) overview here. In the noiseless 
version (a 2 = 0) of the linear observation model Q j one can imagine estimating (3* by solving the 
problem 

mm \\j3\\i subject to x\fi = Y\ t , k = l,...,n. (4) 

This problem is in fact a linear program (in disguise), and corresponds to a method in signal 
processing known as basis pursuit, pioneered by Chen et al. For the noiseless setting, the 
interesting regime is the underdetermined setting (i.e., n < p). With contributions from a broad 
range of researchers [e.g., El El El EEH El EH EH1, there is now a fairly complete understanding 
of conditions on deterministic vectors {x k } and sparsity index s for which the true solution (3* 
can be recovered exactly. Without going into technical details, the rough idea is that the mutual 
incoherence of the vectors {x k } must be large relative to the sparsity index s, and indeed we 
impose similar conditions to derive our results (e.g., conditions ()14a|) and (|18|) in the sequel). Most 
closely related to the current paper — as we discuss in more detail in the sequel — are recent results 
by Donoho 0, as well as Candes and Tao |lj that provide high probability results for random 
ensembles. More specifically, as independently established by both sets of authors using different 
methods, for uniform Gaussian ensembles (i.e., x k ~ -/V(0, I p )) with the ambient dimension p scaling 
linearly in terms of the number of observations (i.e., p = "fn, for some 7 > 1), there exists a constant 
a > such that all sparsity patterns with s < ap can be recovered with high probability. 

There is also a substantial body of work focusing on the noisy setting (a 2 > 0), and the use 
of quadratic programming techniques for sparsity recovery [e.g., EE El El 121 El El EHIES] The 
^i-constrained quadratic program also known as the Lasso |321ll3j . has been the focus of consid- 
erable research in recent years. Knight and Fu |2S] analyze the asymptotic behavior of the optimal 
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solution, not only for l\ regularization but for ^-regularization with p G (0, 2] . Fuchs |17l I18j in- 
vestigates optimality conditions for the constrained QP (JHJ), and provides deterministic conditions, 
of the mutual incoherence form, under which a sparse solution, which is known to be within e of 
the observed values, can be recovered exactly. Among a variety of other results, both Tropp [HI] 
and Donoho et al. jllj also provide sufficient conditions for the support of the optimal solution to 
the constrained QP (0) to be contained within the true support of (5* . Most directly related to the 
current paper is recent work by both Meinshausen and Buhlmann |28| . focusing on Gaussian noise, 
and extensions by Zhao and Yu 35 to more general noise distributions, on the use of the Lasso 
for model selection. For the case of Gaussian noise, both papers established that under mutual 
incoherence conditions and appropriate choices of the regularization parameter A n , the Lasso can 
recover the sparsity pattern with probability converging to one for particular regimes of n, p and s, 
when Xfc drawn randomly from random Gaussian ensembles. We discuss connections to our results 
at more length in the the sequel. 



1.2 Our contributions 



Recall the linear observation model (pQ). For compactness in notation, let us use X to denote 
the n x p matrix formed with the vectors Xk = {xki,Xk2, ■ ■ ■ ,Xk P ) G K p as rows, and the vectors 
Xj = (xij, X2j, . . . , x n j) T G W 1 as columns, as follows: 



X 



[*i X 2 



X n 



(5) 



Consider the (random) set S(X, /?*, W, X n ) of optimal solutions to this constrained quadratic pro- 
gram (j3J). By convexity and boundedness of the cost function, the solution set is always non-empty. 
For any vector (3 G K p , we define the sign function 



sgn(ft) 



+1 if ft > 
-1 if ft < 
if ft = 0. 



(6) 



Of interest is the event that the Lasso © succeeds in recovering the sparsity pattern of the unknown 

/?*: 

Property TZ(X, ft*, W, A n ): There exists an optimal solution j3 G S(X, (3* , W, A n ) with the property 
sgn(ft) = sgn(/3*). 

Our main result is that for a broad class of random Gaussian ensembles based on covariance matrices 
satisfying mutual incoherence conditions, there exist fixed constants < 9g < 1 and 1 < 9 U < +oo 
such that for all v > 0, property 7Z(X, /?*, W, X n ) holds with high probability (over the choice of 
noise vector W and random matrix X) whenever 



n > 2(6 U + u)s \og(p - s) + s + 1, 
and conversely, fails to hold with high probability whenever 

n < 2(0£ -v)s \og(p - s) + s + 1. 



(7) 
(8) 
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Moreover, for the special case of the uniform Gaussian ensemble (i.e., xt ~ N(0,I p )), we show that 
0i = 9 u = l, so that the threshold is sharp. This threshold result has a number of connections to 
previous work in the area that focuses on special forms of scaling. More specifically, as we discuss 
in more detail in Section f3.2| in the special case of linear scaling (i.e., n = jp for some 7 > 0), this 
theorem provides a noisy analog of results previously established for basis pursuit in the noiseless 
case [Hill]. Moreover, our result can also be adapted to an entirely different scaling regime for n,p 
and s, as considered by a separate body of recent work [23 ESI on the high-dimensional Lasso. 

The remainder of this paper is organized as follows. We begin in Section |21 with some necessary 
and sufficient conditions, based on standard optimality conditions for convex programs, for property 
TZ(X, 0*, W, A n ) to hold. We then prove a consistency result for the case of deterministic design 
matrices X. Section E] is devoted to the statement and proof of our main result on the asymptotic 
behavior of the lasso for random Gaussian ensembles. We illustrate this result via simulation in 
Section |1J and conclude with a discussion in Section [5J 



2 Some preliminary analysis 

In this section, we provide necessary and sufficient conditions for property TZ(X,0* , W, A n ) to hold. 
Based on these conditions, we then define collections of random variables that play a central role 
in our analysis. In particular, the study of 1Z(X, 0* , W, A n ) is reduced to the study of the extreme 
order statistics of these random variables. We then state and prove a result about the behavior of 
the Lasso for the case of a deterministic design matrix X. 

2.1 Necessary and sufficient conditions 

We begin with a simple set of necessary and sufficient conditions for property 1Z(X, 0* ,W, \ n ) to 
hold. We note that this result is not essentially new (e.g., see El HH EH EH] for variants), 
and follows in a straightforward manner from optimality conditions for convex programs (2^; see 
Appendix El for further details. We define S := {i E {1, . . . ,p} \ 0* 7^ 0} to be the support of 0*, 
and let S c be its complement. For any subset T C {1, 2, . . . ,p}, let Xt be the n x \T\ matrix with 
the vectors {Xi,i S T} as columns. 

Lemma 1. Assume that the matrix X^Xs is invertible. Then, for any given A > and noise 
vector w G W 2 , property TZ(X, 0*,w, A n ) holds if and only if 



XgcXs (XgX s ) 



n 



Xjw-\ S gn(0* s ) 



1 



n 



X S cW 



- [ -X?Xs 

n 



n 



X^w-\ S ga(0* s ) 



< A, and 



> 0, 



(9a) 
(9b) 



where both of these vector inequalities should be taken elementwise. 



For shorthand, define b := sga(0g), and denote by G M s the vector with 1 in the i position, 
and zeroes elsewhere. Motivated by Lemma much of our analysis is based on the collections of 
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random variables, defined each index i G S and j G S c as follows: 

-i 



Ui 



1 



x s x s 



XiW-Xnb 



1 > 7* 



x T Ax s {x T s x s ) Kb 



Xs {XgXs) Xg — I n xn 



w 

n 



(10a) 
(10b) 



Recall that s = \S\ and N = \S°\ = p — s. Prom Lemma El the behavior of TZ{X, /3*, W, A n ) is 
determined by the behavior of max^g^c \ Vj\ and maxjgs \ Ui\. In particular, condition (|9aj) holds if 
and only if the event 



M(V) 



max \ VA < X r 



(11) 



holds. On the other hand, if we define p n := minjgs then the event 



M(U) := <jmax|?7j| < p r 



(12) 



is sufficient to guarantee that condition (|9b|) holds. Consequently, our proofs are based on analyzing 
the asymptotic probability of these two events. 



2.2 Recovery of sparsity: deterministic design 

We now show how LemmaEJcan be used to analyze the behavior of the Lasso for the special case of 
a deterministic (non-random) design matrix X. To gain intuition for the conditions in the theorem 
statement, it is helpful to consider the zero-noise condition w = 0, in which each observation 
Yfc = x^f3* is uncorrupted. In this case, the conditions of Lemma ^ reduce to 



x T sc x s (x T s x s ) sgn(/%; 



X[-X S X S ) sgn(/%; 



-1 



< 1 

> 0. 



(13a) 
(13b) 



Of course, if the conditions of Lemma ^ foil to hold in the zero-noise setting, then there is little 
hope of succeeding in the presence of noise. 

The zero-noise conditions motivate imposing the following set of conditions on the design matrix: 



xTcX s (xTx s ) 



A-min{ — XT X $) > C m in > 0, 

n 



< (1 — e) for some e G (0, 1], and 



(14a) 
(14b) 



where A m j n denotes the minimal eigenvalue. Under these conditions, we have the following: 
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Proposition 1. Suppose that we observe Y = Xf3* + W , where each column Xj of X is normal- 
ized to l2-norm n, and W ~ N(0,a 2 I). Assume (3* and X satisfy conditions (|14j) . and define 
p n := mines' |/3*|. // A n — > is chosen such that 

( a ) n ^ ^+oo, and (b) — \ + A n \\ (-X^Xs)" 1 |U 1 - 0, (15) 

log(p - s) p n [\ n n j 

then F(K(X, (3*,W, A n ) -^lasn^ +oo. 

Before proving the proposition, we pause to make a number of comments. First, conditions of 
the form (|14a|) have been considered in previous work on the lasso |17l 1181 1251 1341 I35| . In particular, 
various authors (341 1281 135j provide examples and results on matrix families that satisfy this type 
of condition. Moreover, previous work |28| I35j provides asymptotic results for particular scalings 
of p, s and n for random design matrices, as we discuss in more detail in Section |21 To the best of 
our knowledge, Proposition ^ is the first result to provide sufficient conditions for exact recovery 
in deterministic designs with general scaling of p, s and n. 

Second, it is worthwhile to consider Proposition ^ in the classical setting (i.e., in which the 
number of samples n — > +oo with p and s remaining fixed). In this setting, the quantity p n = 
minjgg does not depend on n. Hence, in addition to the condition (|14|) . the requirements reduce 
to A n -> and n\\ +oo. Note that A„ = ^ is one suitable choice. This classical case is also 
covered by previous work j^Sl I2FJ ES| • 

Last, consider the more general setting where all three parameters (n,p,s) grow to infinity, 
and suppose for simplicity that p n stays bounded away from 0. The conditions A^ — > and 
A^ log ^_^ — > +oo imply that the number of observations n must grow at a rate faster than 
\og{p — s). In the following section, in which we consider the more general case of random Gaussian 
ensembles, we will see that for ensembles satisfying mutual incoherence conditions, we in fact require 
that ki(f^y = e(s) - +oo. 



2.3 Proof of Proposition [T] 

Recall the events A4(V) and A4(U) defined in equations and (|12|) respectively. To establish 
the claim, we must show that that F[M(V) C or M(U) C ] — > 0, where M(V) C and M(U) C denote the 
complements of these events. By union bound, it suffices to show both P[.M(F) C ] and P[.M(t7) c ] 
converge to zero, or equivalently that P[.A/f(V)] and P[.M(£/)] both converge to one. 

Analysis of M(V): We begin by establishing that P[A / i(V A )] —> 1. Throughout the proof, we use 
the shorthand b := sgn(/3*) and N := p — s = \S C \. 

Recalling the definition ()10b|) of the random variables Vj, note that Ai(V) holds holds if and 
on ly mir ^ c V J > _i an d max ^ sc v * < 1. Moreover, we note that each Vj is Gaussian with mean 



Mi 



E[Vj] = \ n XjXs{X T s X s ) l b. 



Using condition (|14a() . we have \pj\ < (1 — e) A n for all indices j = 1, . . . ,N, from which we obtain 
that 



max je5 c Vj 1 ~ min je 5 C Vj 1 . 
t < (1 - e) + — maxVj, and ^- > |; — (1 — e) + — m] 

An A ra j A n X n j 



6 



where Vj := Xj 



x s (xTx s ) xl 



W are zero- mean (correlated) Gaussian variables. 



Hence, in order to establish condition (|9ajl of Lemma ^ we need to show that 



— min Vj < — e, 



or 



— max Vi > e 



(16) 



In fact, using Lemma ITTI (see Appendix^, it is sufficient to show that F[ maXi | sc ^ > e] — > 0. By 
applying Markov's inequality and Gaussian comparison results |25j (see Lemma OH in Appendix lB|) . 
we obtain 



maxjg^c \ Vj\ 



> e 



E[max igsc |^l] , SV^jN f~~Z 

< ^ — < max a/E[K- J 

A n A n j v 



Straightforward computation yields that 



Xs (XgXs) Xc 



Xj < — 7r||A" ? -| 
TV 



n 



since the matrix I nxtl — Xs (XjXg) 1 X$ has maximum eigenvalue equal to one, and HAjUl = n 
by construction. Consequently, condition (a) in the theorem statement — namely, that — > 

is sufficient to ensure that E[V(7v)]/A n — ► 0. Thus, we have established F(M(V)) — > 1 (i.e., that 
condition (|9a|) holds w.p. one as n — > +oo). 



Analysis of A*l(i7): We now show that W(M(U)) — ► 1. Beginning with the triangle inequality, 
we upper bound max, \Uj\ := \\{^X^ Xs^^X^W - A„ sgn(/3J)]||oo as 



max | C/j | < 

i 



(-xTxs^-xTw 



+ 



(-XjXs) 
n 



-i 



A r 



Let &j denote the unit vector with one in position i and zeroes elsewhere. Now define, for each index 
i G S, the Gaussian random variable Z% := ej \^-Xg Xs)" 1 -^XgW . Each such Z{ is a zero-mean 
Gaussian with variance given by 



var(^) = ^el(^X s )-\ < 



n n 



Hence, by a standard Gaussian comparison theorem [23] (in particular, see Lemma |§] in Ap- 
pendix El, we have 



E[max \Zj\ 

Ki<s 



E 



< 3i 



-X^Xsy'-X^W 



I a 2 log s 



nCn 
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Thus, recalling the defining p n := min^s we apply Markov's inequality to conclude that 



1 



1 



/ s + [~Xj i X s 



ixJ W -Asgn( K 



> 



< 



1 



— max \Ui\ > 1 

Pn !<«<s 



< 



< 



1 



1 



— <^ max |Zi| + A n ||(-XiX s ) 



v — 1 I 



1 



U<*< 

E 



n 



> 1 



max \Zi 

Ki<s 



l 



n 



x s x s ) 



-li 



i / 8 ^i +x ., ( i^ r .| U l, 



which converges to zero as n — > +oo, using condition (b) in the theorem statement. 



□ 



3 Recovery of sparsity: random Gaussian ensembles 

We now turn to the analysis of random design matrices X, in which each row Xk is chosen as an 
i.i.d. Gaussian random vector with covariance matrix E. In particular, we prove the existence 
of thresholds that provide a sharp description of the failure/success of the Lasso as a function of 
(n,p,s). We begin by setting up and providing a precise statement of the main result, and then 
discussing its connections to previous work. In the later part of this section, we provide the proof. 

3.1 Statement of main result 

Consider a covariance matrix E with unit diagonal, and with its minimum and maximum eigenvalues 
(denoted A m j n and A max respectively) bounded as 

Amin(^ss) ^ C m in, and A max (E) < C max (17) 

for constants C m i n > and C max < +oo. Given a vector (3* £ W, define its support S = {i G 
{1, . . . ,p} | f3* ^ 0}, as well as the complement 5 C of its support. Suppose that E and S satisfy 
the conditions ||(E55) _1 || 00 < D nVAX for some -D max < +oo, and 

||Es C 5(S5s) _1 ||oc < (1-e) (18) 

for some e G (0, 1]. Under these conditions, we consider the observation model 

Y k = xlp* + W k , k = l,...,n, (19) 

where x k ~ iV(0,E) and Wk ~ N(0,a 2 ) are independent Gaussian variables for k = 1, ...,n. 
Furthermore, we define p n := minjgg and the sparsity index s = \S\. 

Theorem 1. Consider a sequence of covariance matrices {E[p]} and solution vectors {f3*\p\} satis- 
fying conditions (fTTj) and (|18|). Under the observation model (fTU|) . consider a sequence (n,p(n), s(n)) 
such that s, (n — s) and (p — s) tend to infinity. Define the thresholds 



(\/C max \ C max (~i ) (~J 

V ^max ^ 1 j n ^max 



< 1, and 9 U := ™ ax > 1. (20) 



Cmax (2 — e) 2 e^Cn 
Then for any constant v > 0, we have the following 
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(a) If n < 2(0£ — u) s log(p — s) + s + 1, then F[TZ(X, f3* , W, X n )] — ► /or any non-increasing 
sequence X n > 0. 

^ Conversely, if n > 2(9 U + i/) s log(p — s) + s, and A n — > is chosen such that 



nXl , 1 



71 



+oo, and — 
log(p - s) p n 



An + 



log S 



J? 



0, (21) 



Remark: Suppose for simplicity that p n remains bounded away from 0. In this case, the require- 
ments on A n reduce to X n — * 0, and X 2 n nj log(p— s) — > +oo. One suitable choice is A n = lQ g( s ) 1 °s(p~ s ) ^ 
with which we have 



n 



slog(p-s)\ log(s) 



and 



nA:? 



log(p - s) 



log(s) — > +oo. 



Without a bound on y9 n , the second condition in equation (j21jl constrains the rate of decrease of 
the minimum p n = minjgs \ f3*\. 

3.2 Some consequences 

To develop intuition for this result, we begin by stating certain special cases as corollaries, and 
discussing connections to previous work. 

3.2.1 Uniform Gaussian ensembles 

First, we consider the special case of the uniform Gaussian ensemble, in which £ = I pX p- Previous 
work by Donoho as well as Candes and Tao 0] has focused on the uniform Gaussian ensemble in 
the the noiseless (a 2 = 0) and underdetermined setting (n = jp for some 7 G (0, 1)). Analyzing the 
asymptotic behavior of the linear program Q for recovering /3*, the basic result is that there exists 
some a > such that all sparsity patterns with s < ap can be recovered with high probability. 

Applying Theoremnto the noisy version of this problem, the uniform Gaussian ensemble means 
that we can choose e = 1, and C m i n = C max = 1, so that the threshold constants reduce 



( ^JC max \ C max (~i ) f~< 

and 



"l n in _ \2 " u ,2/7 . 

Consequently, Theorem ^ provides a sharp threshold for the behavior of the Lasso, in that fail- 
ure/success is entirely determined by whether or not n > 2s log(p — s) + s + 1. Thus, if we consider 
the particular linear scaling analyzed in previous work on the noiseless case 001) we have: 

Corollary 1 (Linearly underdetermined setting). Suppose that n = ^p for some 7 6 (0, 1). 
Then 
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(a) If s = ap for any a £ (0, 1), then P [TZ(X, (3*, W, X n )] — ► for any positive sequence X n > 0. 

(b) On the other hand, if s = O( j^), then P [R,(X, /3*, W, X n )] — > 1 for any sequence {X n } 
satisfying the conditions of Theorem^ a). 

Conversely, suppose that the size s of the support of /3* scales linearly with the number of parameters 
p. The following result describes the amount of data required for the ^-constrained QP to recover 
the sparsity pattern in the noisy setting (a 2 > 0): 

Corollary 2 (Linear fraction support). Suppose that s = ap for some a £ (0,1). Then we 
require n > 2aplog[(l — a)p] + ap in order to obtain exact recovery with probability converging to 
one for large problems. 

These two corollaries establish that there is a significant difference between recovery using basis 
pursuit © in the noiseless setting versus recovery using the Lasso (jSJ in the noisy setting. When the 
amount of data n scales only linearly with ambient dimension p, then the presence of noise means 
that the recoverable support size drops from a linear fraction (i.e., s = ap as in the work [SlU]) to 
a sublinear fraction (i.e., s = O(^jp), as in Corollary^). 



3.2.2 Non-uniform Gaussian ensembles 

We now consider more general (non-uniform) Gaussian ensembles that satisfy conditions (|17jl 
and (|18|) . As mentioned earlier, previous papers by both Meinshausen and Buhlmann as 
well as Zhao and Yu [HE] treat model selection with the high-dimensional Lasso. For suitable co- 
variance matrices (e.g., satisfying conditions (fTTj) and (|T%)0. both sets of authors proved that the 
sparsity pattern can be recovered exactly under scaling conditions of the form 

s = 0(n Cl ), and p = 0(e nC2 ), where a + c 2 < 1. (22) 

Applying Theorem ^ m this scenario, we have the following: 

Corollary 3. Under the scaling (|22|). the Lasso will recover the sparsity pattern with probability 
converging to one. 

Proof. Substituting the conditions (|2"2")) into the threshold condition (JJJ), we obtain that the RHS 
takes the form 

2slog(p-s) + s + l = 0(n Cl ) log [0{e nC2 ) - 0(n Cl )] +0(n Cl ) 
= 0(n Cl+C2 ) < n, 

since c\ + c 2 < 1 by assumption. Thus, we see that under these conditions, our threshold condi- 
tion (J2J) is satisfied a fortiori. □ 

In fact, under this stronger scaling l)22[). both papers [2211331 proved that the probability of exact 
recovery converges to one at a rate exponential in some polynomial function of n. Interestingly, our 
results show that the Lasso can recover the sparsity pattern for a much broader range of (n,p,s) 
scaling. 
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3.3 Proof of Theorem [T](b) 

We now turn to the proof of part (b) of our main result. As with the proof of Proposition ^ the 
proof is based on analyzing the collections of random variables {Vj \ j £ S c } and {Ui \ i G S}, 
as defined in equations (|ll)a[) and (|10b|) respectively. We begin with some preliminary results that 
serve to set up the argument. 

3.3.1 Some preliminary results 

We first note that for s < n, the random Gaussian matrix X$ will have rank s with probability 
one, whence the matrix XgX$ is invertible with probability one. Accordingly, the necessary and 
sufficient conditions of Lemma^are applicable. Our first lemma, proved in Appendix lD.il concerns 
the behavior of the random vector V = (Vi, . . . , Vjv), when conditioned on X$ and W . Recalling 
the shorthand notation b := sgn(/3*), we summarize in the following 

Lemma 2. Conditioned on X$ and W , the random vector (V \ W, Xs) is Gaussian. Its mean 
vector is upper bounded as 

\E[V | W,X S ]\ < A„(l-e)l. (23) 
Moreover, its conditional covariance takes the form 

cov[V | W,X S ] = M n E (5 c| S) = M n [Zs^-Zs'S&Ssr^SS*], (24) 

where 

M n := Xl^iXjXsy'b +^W T [l nxn -Xs{X^X s y 1 Xj]\V (25) 

is a random scaling factor. 

The following lemma, proved in Appendix ID .21 captures the behavior of the random scaling 
factor M n defined in equation ((231) : 

Lemma 3. The random variable M n has mean 

E[M„] = -^L^b T (Xssr 1 b + a2{n ~ S) . (26) 
n — s — 1 n z 

Moreover, it is sharply concentrated in that for any 5 > 0, we have 

F [\M n - E[M„]| > 5E[M n ]] ^0 as n -> +oo. (27) 

3.3.2 Main argument 

With these preliminary results in hand, we now turn to analysis of the collections of random 
variables {Ui,i S S} and {Vj,j £ S c }. 
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Analysis of Ai(V): We begin by analyzing the behavior of maxjggc \ Vj\. First, for a fixed but 
arbitrary S > 0, define the event T(S) := {\M n - E[M n ]\ > 5E[M n ]}. By conditioning on T(5) and 
its complement [T(<5)] c , we have the upper bound 



[max I Vi I > X n ] < 



max I Vj I > A r 



+ P[T(5)}. 



By the concentration statement in Lemma 01 we have P[T(<5)] —* 0, so that it suffices to analyze 
the first term. Set \ij = E[Vj\Xs], and let Z be a zero-mean Gaussian vector with cov(Z) = 
cov{V\X s ,W). 

max I VJ = max liij + 

jes c J jeS c J J 

< max [|/i 7 -| + IZJ] 

< (1 — e)A n + max \Zj\, 

jes c 

where we have used the upper bound (|2Hjl on the mean. This inequality establishes the inclusion 
of events 



{max I Zj I < eA n j C {maxlV,! < A n j, 

j'es c jes c 

thereby showing that it suffices to prove that P[maxj e s c \ > tX n | p~(<5)] c ] — > 0. 

Note that conditioned on [T((5)] c , the maximum value of M n is v* := (1 + <5)E[M n ]. Since 
Gaussian maxima increase with increasing variance, we have 



max | Zj | > e\ n \ [T(5)] c 



< 



max \ Zj\ > eX n 



where Z is zero-mean Gaussian with covariance v* H(gcig\. 

Using Lemma it suffices to show that P[max je sc Zj > e\ n ] converges to zero. Accordingly, 
we complete this part of the proof via the following two lemmas, both of which are proved in 
Appendix iDl 

Lemma 4. Under the stated assumptions of the theorem, we have w — > and 



lim — IE [max Z j 1 

n->+oo X n jeS c J 



Lemma 5. For any rj > 0, we have 



max Zj > rj + E[max Zj] 



< e. 



(28) 



Lemma 0] implies that for all 5 > 0, we have Efmaxjg^c Zj] < (1 + |)eA„ for all n sufficiently 
large. Therefore, setting r/ = |A„e in the bound p8|) . we have for fixed 5 > and n sufficiently 
large: 



max Zj > (1 + 5)X n e 
jeS* J 



< 



max Zj > — X n e + E[max Zj] 

j&S c J 2 jes c J 



< 2 exp 



8v* 
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From Lemma|IJ we have A^/i>* — > +00, which implies that P[maxj g sc Zj > (1 + 5)X n e] — ► for all 
5 > 0. By the arbitrariness of 5 > 0, we thus have P[max Jg sc Zj < eA n ] — ► 1, thereby establishing 
that property (jHajl of Lemma [D holds w.p. one asymptotically. 



Analysis of {Ui}: Next we prove that maxjgs \Ui\ < p n := minj g s with probability one 
as n — > +00. Conditioned on X$, the only random component in U is the noise vector W. A 
straightforward calculation yields that this conditioned RV is Gaussian, with mean and variance 



Yi := E[U I X s ] 



1 



-^n,e} [ —XgXs 



-1 



Y' 



1 T 

-X S X S 
n 



-1 



vnrpi I X S ] = —e. 

n 

respectively. The following lemma, proved in Appendix ID.51 is key to our proof: 
Lemma 6. (a) The random variables Yi and Y- have means 



EHK-1 



A n n 



' ' n — s — 1 
respectively, which are bounded as 

|E[K1| < ^? maxnA " 



ef (Sss)- 1 ?, 



and 



E\Y'} 



n — s — 1 



ef(Ess) x ei 



and 



a 



o~ D n 



n - s - 1 C max (n — s — 1) 1 n - s - 1 

f6J Moreover, each pair (Yi,Y-) is sharply concentrated, in that we have 



\Yi\ > 



n 



n — s — 1 

where K is a fixed constant independent of n and s. 



or \Y/\> 2E\Y'} 



< 



K 



n — s 



(29) 



(30) 



(31) 



We exploit this lemma as follows. First define the event 

T{8) ■.= u(|y,|> 6Apax " A ; ; or |y/|>2E[y/]l. 



n — 8 — 1 

1=1 y. > 
By the union bound and Lemma E(b), we have 

mm < ,JL- s _ ^ 0, 

since ~ — > +00 as n — > +00. For convenience in notation, for any a £ R and b E R+, we use U(a, b) 
to denote a Gaussian random variable with mean a and variance b. Conditioning on the event T{8) 
and its complement, we have 

PfmaxC/i > pJ < Flm&xUi > p n I T(5) c ] + F[T(S)] 

ies ' ies 

< P[ m axC/ i (//*,<)>p n ] + 7r ^-, (32) 

i£b — — 1 
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where each Ui(p*,v*) is Gaussian with mean p* := 6-D max A n 



— and variance vf 

n—s—l i 



2E[Y!] 



respectively. In asserting the inequality (|32|). we have used the fact that the probability of the event 
{maxjgsli > p n } increases as the mean and variance of Yi increase. Continuing the argument, we 
have 

F[m a Mp*,v*)> p n ] < P[max|^(/i*,<)| > p n ] 



< — E 

Pn 



m&x\Ui(p*,v*)\ 



where the last step uses Markov's inequality. We now decompose Ui(fi*,v*) = 2-D max A n w _" 1 + 
Ui(0,v*), and write 



E 



max\Ui(p*,v*)\ 



n 



+ E 



n 



max | [/(0,<) | 



With this decomposition, we use the bound 1)30(1 onu* := 2E[1^'] and LemmaEJ]on Gaussian maxima 
(see Appendix El to conclude that 



i-E 

Pn 



max\Ui(p*,v*)\ 



1 

< 

Pn 



A n - 



■n 



+ 3 



2cr 2 /J 

t 1 1 ax 

logs 



n—s—l V n—s—l 
which converges to zero by the second condition (|21|) in the theorem statement. 



3.4 Proof of Theorem [T]( a) 

We establish the claim by proving that under the stated conditions, maxj 6 sc \ Vj\ > X n with prob- 
ability one, for any positive sequence A ra > 0. We begin by writing Vj = E[Vj] + Vj, where Vj is 
zero-mean. Now 



max I Vj I > max I V, I — max I E \Vj 1 1 

> max \Vj\ — (1 — e)A n 

jeS c 

where have used Lemma HJ Consequently, the event {maxjg^c \ Vj\ > (2 — e)A n } implies the event 
{maxj-ggc \ Vj\ > A n }, so that 

P[max \Vj\ > AJ > Pfmax \VA > (2 - e) AJ . 

From the preceding proof of Theorem^b), we know that conditioned on X$ and W, the random 
vector (Vi, . . . , Vjy) is Gaussian with covariance of the form M n [£s c s c — Ss^X^) -1 !^^]; thus, 
the zero- mean version (Vi, . . . , Vn) has the same covariance. Moreover, Lemma |3] guarantees that 
the random scaling term M n is sharply concentrated. In particular, defining for any 5 > the event 
T{$) := { \M n - E[M n ]| > 5E[M n ]}, we have P[T(5)] -» 0, and the bound 



[max | ^ | > (2 - e) A n ] > (1 - P[T(<5)]) 

jes c 

> (i-P[r(<y)]) 



max | Vj | > (2-e)A„ | T(S) C 



max > (2 — e) A r 

jes c 



14 



where each Zj = Zj{v*) is the conditioned version of Vj with the scaling factor M n fixed to 
v* := (1 — <5)E[M n ]. (Here we have used the fact that the probability of Gaussian maxima decreases 
as the variance decreases, and that var(Vj) > v* when conditioned on T(5) c .) 

Our proof proceeds by first analyzing the expected value, and then exploiting Gaussian concen- 
tration of measure. We summarize the key results in the following: 

Lemma 7. Under the stated conditions, one of the following two conditions must hold: 

A 2 1 

(a) either -f — > +oo, and there exists some 7 > such that — Efmaxjg^c Zj] > (2 — e) [1 +7] 

for all sufficiently large n, or 

(b) there exist constants 0,7 > such that jr < a and j-E,[m&~Xj£S c Zj] > 7^/log N for all 
sufficiently large n. 

Lemma 8. For any rj > 0, we have 

F[m^Z j (v*)<nn^Z j (v*)]-ri] < exp (~Y (33) 

Using these two lemmas, we complete the proof as follows. First, if condition (a) of Lemma 
holds, then we set 77 = ( 2 ~ e ^ A " [ n equation (j<33j) to obtain that 

mr -J-maxZ j (7j*) > (2- e )(l + I)] > 1 - exp ^ ( 2 ~ e ) 2 7 2 A 2 



This probability converges to 1 since % — > +00 from Lemma [7fa). 

On the other hand, if condition (b) holds, then we use the bound j-Efmaxjg^c Zj] > 7\/log N 

and set r/ = Tjb^^KK [ n equation (|33j) to obtain 



1 1 Wloe IV 

[— maxZ 7 -(u*) >2(2-e)] > P[— maxZj(i)*) > yviUfeiV 



> 1 — exp 



8v* 



A 2 

This probability also converges to 1 since > 1/ot and logiV — > +00. Thus, in either case, we 
have shown that lim n ^ +00 m&Xj^s Zj(v*) > (2 — e)] = 1, thereby completing the proof of 
Theorem ^a). 

4 Illustrative simulations 

In this section, we provide some simulations to confirm the threshold behavior predicted by Theo- 
rem Q We consider the following three types of sparsity indices: 

(a) linear sparsity, meaning that s(p) = ap for some a £ (0, 1); 

(b) sublinear sparsity, meaning that s(p) = ap/(\og{ap)) for some a G (0, 1), and 

(c) fractional power sparsity, meaning that s(p) = ap 1 for some a,7 G (0, 1). 
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For all three types of sparsity indices, we investigate the success/failure of the Lasso in recovering 
the sparsity pattern, where the number of observations scales as n = 2 9slog(p — s) + s + 1. The 
control parameter 9 is varied in the interval (0, 2.4). For all results shown here, we fixed a = 0.40 
for all three ensembles, and set 7 = 0.75 for the fractional power ensemble. In addition, we set 

An = ^ log(p-s)log( B ) in ^ caseg _ 

We begin by considering the uniform Gaussian ensemble, in which each row is chosen in an 
i.i.d. manner from the multivariate N(0, I pX p) distribution. Recall that for the uniform Gaussian 
ensemble, the critical value is 9 U = 0£ = 1. Figure ^ plots the control parameter 9 versus the 
probability of success, for linear sparsity (a), sublinear sparsity pattern (b), and fractional power 
sparsity (c), for three different problem sizes (p £ {128,256,512}). Each point represents the 
average of 200 trials. Note how the probability of success rises rapidly from around the predicted 



Identity; Linear Identity; Sublinear Identity; Fractional power 




Control parameter 6 Control parameter 9 Control parameter 



(a) (b) (c) 

Figure 1. Plots of the number of data samples (indexed by the control parameter 9 versus the 
probability of success in the Lasso for the uniform Gaussian ensemble. Each panel shows three curves, 
corresponding to the problem sizes p G {128, 256, 512}, and each point on each curve represents the 
average of 200 trials, (a) Linear sparsity index: s(p) = ap. (b) Sublinear sparsity index s(p) = 
ap/\og(ap). (c) Fractional power sparsity index s(p) = ap 1 with 7 = 0.75. 



threshold point 9 = 1, with the sharpness of the threshold increasing for larger problem sizes. 

We now consider a non-uniform Gaussian ensemble — in particular, one in which the covariance 
matrices £ are Toeplitz with the structure 



1 


p 






pp-l pp 


p 


1 


p 


P 2 


... pp- 


P 2 


p 


1 


p 


... pp~ 


p" 




P 3 


P 2 


p 1 



for some p £ (— 1, +1). As shown by Zhao and Yu [SHI) this family of Toeplitz matrices satisfy con- 
dition (|T%|) . Moreover, the maximum and minimum eigenvalues (C m j n and C max ) can be computed 
using standard asymptotic results on Toeplitz matrix families |20| . Figure |2 shows representative 
results for this Toeplitz family with p = 0.10. Panel (a) corresponds to linear sparsity s = ap 
with a = 0.40), and panel (b) corresponds to sublinear sparsity (s = ap/log(ap) with a = 0.40). 
Each panel shows three curves, corresponding to the problem sizes p E {128,256,512}, and each 
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p = 0.10; Linear p = 0.10; Sublinear p = 0.10; Fractional power 




0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 

Control parameter 6 Control parameter 8 Control parameter 6 



(a) (b) (c) 

Figure 2. Plots of the number of data samples (indexed by the control parameter versus the 
probability of success in the Lasso for the Toeplitz family (|34|l with p = 0.10. Each panel shows 
three curves, corresponding to the problem sizes p £ {128,256,512}, and each point on each curve 
represents the average of 200 trials, (a) Linear sparsity index: s(p) = ap. (b) Sublinear sparsity 
index s(p) = ap/log(ap). (c) Fractional power sparsity index s(p) = ap 1 with 7 = 0.75. 



point on each curve represents the average of 200 trials. The vertical lines to the left and right of 
9 = 1 represent the theoretical upper and lower bounds on the threshold (9 U ~ 1.84 and 9g ~ 0.46 
respectively in this case). Once again, these simulations show good agreement with the theoretical 
predictions. 



5 Discussion 



The problem of recovering the sparsity pattern of a high-dimensional vector (3* from noisy ob- 
servations has important applications in signal denoising, graphical model selection, sparse ap- 
proximation, and subset selection. This paper focuses on the behavior of ^i-regularized quadratic 
programming, also known as the Lasso, for estimating such sparsity patterns in the noisy and 
high-dimensional setting. The main contribution of this paper is to establish a set of general and 
sharp conditions on the observations n, the sparsity index s (i.e., number of non-zero entries in 
/?*), and the ambient dimension p that characterize the success/failure behavior of the Lasso in 
the high-dimensional setting, in which n, p and s all tend to infinity. For the uniform Gaussian 
ensemble, our threshold result is sharp, whereas for more general Gaussian ensembles, it should be 
possible to tighten the analysis given here. 
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A Proof of Lemma [T] 

By standard conditions for optimality in a convex program [2^) the point (3 6 M p is optimal if and 
only if there exists a subgradient z G d£i(f3) such that 

-X T X(3--X T y + \z = 0. (35) 
n n 

Here the sub differential of the l\ norm takes the form 

d£x0) = {z£R p | % = sgn(A) for % ^ 0, |%| < 1 otherwise} . 
Substituting our observation model y = X(3* + w and re-arranging yields 

-X T X0 - /?*) - -X T w + Xz = 0. (36) 
n n 

Now condition TZ(X, (3* , u;, A) holds if and only we have 

As c = 0, As / 0, and z s = sgn(/3£), \z S c\ < 1. 

From these conditions and using equation H36j) . we conclude that the condition TZ(X, /3*, w, A) holds 
if and only if 

-X T SC X S (% ~ P*s) ~ -X^w = -Xz S c 
n \ In 

^X%X s (Ps-Ps)-^X$w = -Asgn(/3|). 
Using the invertibility of XTXs, we may solve for (3s and zs<= to conclude that 



A zs" = XgcX s (XgX s ) 



-Xfw -\sgn(P* SJ 

-XT W -\ S gn((3 s ) 
n 



1 rp 

— XqcW 



From these relations, the conditions \zs c \ < 1 an d (3s ^ yield conditions (|9a|) and (|9b|) respectively. 

B Some Gaussian comparison results 

We state here (without proof) some well-known comparison results on Gaussian maxima |25| . We 
begin with a crude but useful bound: 

Lemma 9. For any Gaussian random vector {X\, ■ ■ ■ ,X n ), we have 



E max IXA < 3J\ogn max JEXf. 

l<i<n l<i<n V 1 

Next we state (a version of) the Sudakov-Fernique inequality |2S1 15]: 

Lemma 10. Let X = (X\, . . . , X n ) and Y = (Y\, . . . , Y n ) be Gaussian random vectors such that 
for all i,j 



E[(Yi - Yj) 2 ] < E[(Xi-Xj) 2 ]. 



Then E[ max Yi] < E[ max X;]. 

Ki<n Ki<n 
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C Auxiliary lemma 

For future use, we state formally the following elementary 

Lemma 11. Given a collection {Z±, Z2, . . . , Zn} of zero-mean random variables, for any constant 
a > we have 



f max \ZA < a] < Pf max Zj < a], 

l<j<N l<j<N J 

f max \ZA > a] < 2P[ max Zj > a}. 

l<j<N l<j<N 



and 



(39a) 
(39b) 



Proof. The first inequality is trivial. To establish the inequality (|39b|) . we write 
Pf max \Zj\ > a] = Pf( max Zj > a) or ( min Zj < —all 

l<j<N ^<j<N l<j<N 

< Pf max Zj > a] +Pf min Zj < -a] 

l<j<N i<j<N J 

= 2Pf max Zj > a], 

t<j<N J 

where we have used the union bound, and the symmetry of the events {maxi<j<7v Zj > a} and 
{mini<j<Ar Zj < — a}. □ 

D Lemma for Theorem [T] 
D.l Proof of Lemma [2] 

Conditioned on both X$ and W, the only random component in Vj is the column vector Xj. Using 
standard LLSE formula [e.g., [2] (i.e., for estimating X$c on the basis of Xs), the random variable 
(X$c j Xs,W) ~ (Xs<: I Xs) is Gaussian with mean and covariance 



E[Xf c I X S , W] 
v&r(Xsc I Xs) 

Consequently, we have 



->S C S\2->SS) 



J (S C \S) 



1 Y T 



^S c S c — ^s c s(^ss) 1 ^ss c - 



(40a) 
(40b) 



\E[Vj I X S ,W}\ 



^s c s{^ss) 1 Xs\ x s (XgXs) A n 6 - X s (XgX s ) Xg - /, 
< A n (l-e)l, 



W 

n 



as claimed. 

Similarly, we compute the elements of the conditional covariance matrix as follows 



cov(Vj,V k \X s ,W) 



cov(Aj h A ki I X S ,W) \\lb T {X^X s )- l b + —W 



rT 



n- 



Inxn — Xs (XgXs) Xs 



W )■ . 
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D.2 Proof of Lemma |3] 

We begin by computing the expected value. Since XgXg is Wishart with matrix the random 



matrix (XgXs) 1 is inverse Wishart with mean E[(Xj Xs) x ] 
Anderson [[]). Hence we have 



11 _ C^ss) r see L emma 7.7.1 of 

n—s—l v 



E 



x 2 J T (xlx s ) l b 



\2 



n — s — l 



b T (x S sr l b. 



(41) 



Now define the random matrix R = I nxn — Xs(Xj Xs)~ 1 Xg . A straightforward calculation yields 
that R 2 = R, so that all the eigenvalues of R are either or 1. In particular, for any vector z = Xgu 
in the range of Xs, we have 



RZ = [Inxn " X S (X^X S )- l Xl] X S U = 0. 



(42) 



Hence dim(keri?) = dim(range X$) = s. Since R is symmetric and positive semidefinite, there 
exists an orthogonal matrix U such that R = U T DU, where D is diagonal with (n — s) ones, and s 
zeros. The random matrices D and U are both independent of W, since Xs is independent of W. 
Hence we have 

^E [W T RW I X s ] = ^E [W T U T DUW \ X s ] 



\ trace DUU T E [WW T \ X s ] 



n 



a 



n — s 



11- 



since E[H^VF T ] = a 2 l. Consequently, we have established that E[M„ 

- ^ ^ as claimed. 

We now compute the expected value of the squared variance 



(43) 

^ bT^ssr'b + 



Mi 



b T (X T s X s ) x b 



+ 2^f 
n 



b T (X T s X 3 ) l b 



[W T RW) + —r (W T RW) 2 



rr 



Ti T 2 T 3 

First, conditioning on Xs and using the eigenvalue decomposition D of R, we have 



1 



nn\x s ] = ^jE[(w t dw) 



n 



r E 



rr 



i=l 



2(n-s)a 4 | (n-s) 2 a A 



n 1 



rr 



whence E[T 3 ] = + as well. 



(44) 
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Similarly, using conditional expectation and our previous calculation Q43|) of E[VF T i?l4 y | Xs], 
we have 



EfTo 



2 -4e 

n z 



E 



b^X^Xs)'^ (W+RW) | X s 



2Xl ( n ~ S W 



■E 



b T (x^x s )- i: b 



2X1 {n-s)a 2 tT l% 

~Y7 T\ K^SS) b. 

n z (n — s — 1) 



(45) 



where the final step uses Lemma 7.7.1 of Anderson pQ on the expectation of inverse Wishart 
matrices. 

Lastly, since (Xg X$)~ l is inverse Wishart with matrix (Egg) -1 , we can use formula for second 
moments of inverse Wishart matrices (see, e.g., Siskind to write, for all n > s + 3, 



E[Tx 



X, 



b'pssr'b 



(n — s) (n — s — 3) 
Consequently, combining our results, we have 

var(M n ) = E[M 2 ] - (E[M„J) 2 

fa 4 (n-s) 2 a 2 (n-s) X 2 
i=i { 



1 + 



1 



n — s — 1 



+ 2- 



b 1 (E ss )- r b + 



n 4 n 2 n — s — 1 \ n — s — 1 



b T (Xss] 



-li 



2(n-s)a 4 . A 4 [^(Sss)" 1 ?] 2 



Hi 



+ 



I ^ n — s — 1 (n — s — 3) 



(n — s — 1) (n — s — 3) [ (n — s) (n — s) (n — s — 1) 



(46) 



#2 



Finally, we establish the concentration result. Using Chebyshev's inequality, we have 

var(M n ) 



•[\M n -E[M n ]\ >5E[M n ]] < 



<5 2 (E[M n ]) 2 ' 



so that it suffices to prove that var(M n )/(E[M n ]) 2 — ► as n — > +oo. We deal with each of the two 
variance terms Hi and H2 in equation (|46|) separately. First, we have 



Hi ,. 2(n-s)a 4 n 4 



(E[M n ]) 2 ~ n 4 (n-s) 2 cr 4 n-s 
Secondly, denoting A = {b T (X'g Xs)~v) for short-hand, we have 



0. 



H7 



{E[M n }f 



< 



(n — s — 1) 



XiA 2 



n — s — 1 (n — s — 3) 
A 4 ^4 2 (n — s — 1) (n — s — 3) |_ (n — s) ' (n — s) (n — s — 1) 



+ 



(n — s — 1) 



+ 



n — s — 1 (n — s — 3) 



(n — s — 3) \ (n — s) (n — s) (n — s — 1) J 
which also converges to as (n — s) — > 0. 
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D.3 Proof of Lemma |U 

Recall that the Gaussian random vector {Z\, . . . , Zn) is zero-mean with covariance v*Htgc\g\, where 
S(5fcig) := ^s c S c — T,s<=s{^ss)~ l '^ss c - For any index i, let &i 6 R-^ be equal to 1 in position i, and 
zero otherwise. For any two indices i ^ j, we have 

mZi-Zj) 2 ] = v*(ei-e j ) T T l{S c ls) (e i -e j ) 

< 2ti*A max (S( 5 - C |5)) 

< 1C max v , 

since £( S c|s) ^ S S C S C by definition, and A maa; (X! S c 5c ) < A max (T,) < C max . 

Letting (X 1 ,... , X N ) ~ N(0, C max v*I NxN ), we have E[(Xj-X,-) 2 ] = 2C max v* . Hence, applying 
the Sudakov-Fernique inequality |25; yields E[maxj Zi\ < E[max, Xj]. By asymptotic behavior of 

i.i.d. Gaussians EllZj, we have lirn/v-xx) ,^ maX3 ^ AT = 1. Consequently, for all 5' > 0, there 

V^mai^ log iv 

exists an N(6') such that for all TV > N(5'), we have 
—E[maxZj(v*)] < —E[maxXj] 

An 3 A n j 



< a+s') 



(i + o ./ aq -'°« 1 V( Sm )-'g + 2C — 72(1 :"" ogiV 

V n — s — 1 n\l 



< + 2C max slogN 1 ^ 2C maa; o- 2 log iV 

~ y n - s - 1 C min raA 2 

Now, applying our condition bounding n,N via f and 6 U , we have 



z/logiV A , 2C max a 2 logN 



-^E[maxZ>*)] < (1 + 6') Vl + 5 J e 2 ( 1 - + 

An J V V n - s - 1 / 



X n j V\n — s — IJ nX 2 

Recall that by assumption, as n, N — > +oo, we have that and converge to zero. Con- 

sequently, the RHS converges to (1 + 5') + 5)e as n, A^ — > oo. Hence, we have 

lim -3-E[maxZ,(u*)] < (1 + 5') v^TTI e. 

n-»+oc A„ j 

Since 6' > and (5 > were arbitrary, the result follows. 

D.4 Proof of Lemma [5] 

Consider the function / : M> N — ► R given by 



f(w) := max 

!<7<iV 
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where S^cig) := £s c S c — ^s c sC^ss) 1 ^5S C - By construction, for a Gaussian random vector V ~ 

N(0,I), we have /(V) = maxjg^c Zj. 

We now bound the Lipschitz constant of /. Let R = y^scjs). For each w, v G and index 
jf = 1, . . . , N, we have 



[Vv*Rw]j — [Vv*Rv}. ) 



< Vv* 



^Rjklwk - v k ] 



< VV*\\W — V\\2, 

where the last inequality follows since Ylk^jk = [^(S c \s)]jj — 1- Therefore, by Gaussian con- 
centration of measure for Lipschitz functions |241 127j . we conclude that for any r/ > 0, it holds 
that 

P[f(W)>E[f(W)]+r,} < expT-^V and 



F[f(W)<E[f(W)]-r,] < exp 



~2v* 



D.5 Proof of Lemma |H1 

Since the matrix XTX$ is Wishart with n degrees of freedom, using properties of the inverse 



Wishart distribution, we have E[(Xj X$)' 
we compute 



(see Lemma 7.7.1 of Anderson PQ). Thus, 



E[Y{] 



A " n -ef (S55) 1 6, and 



n — s — 1 



it n 



J ej. 



n n — s — 1 1 " n — s — 1 

Moreover, using formulae for second moments of inverse Wishart matrices (see, e.g., Siskind |31| ) 
we compute for all re > s + 3 



E[K 2 ] 

E[(y/) 2 ] 



A 2 n 2 



(n — s) (n — s — 3) 
cr 4 re 2 



''/'l^)- J '') a + ^(5 :r W^ (ef^s)" 1 *) 



n — s 



1 + 



1 



re — s — 1 



(re — s — l) 2 (n — s) (re — s — 3) 

We now compute and bound the variance of Yi. Setting A = ej '(Egg) -1 ? and i?j = e^Egs) -1 ? 
for shorthand, we have 



var(Yj) 



A 2 n 2 



(re — s) (re — s — 3) 

A 2 n 2 
(re — s) (re — s — 3) 



A 2 + — TABi 

re — s — 1 

Ad- (n T s)(n ~*: 3 M + - J— as, 



\2 „2 

(n-s-1) 2 * 



(re — s — 1) 



n — s — 1 



2A r 2 



3 A 2 A-B, 
+ 



re — s n — s — 1 
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for re sufficiently large. Using the bound ||(£s,s) _1 ||oo < 5 m ax, we see that the quantities Ai and 
Bi are uniformly bounded for all i. Hence, we conclude that, for n sufficiently large, the variance 
is bounded as 



var(^) < K>?n 



n — s 



for some fixed constant K independent of s and re. 
Now since |E[Yj]| < 2 ^'" a ^"" , we have 

l^-E^I > I^I-IE^II > \Yi\ 
Consequently, making use of Chebyshev's inequality, we have 



re — s — 1 



\YA > 



6-DmaxA ra n 

n — s — 1 



n — s — 1 ~~ n — s — 1 

4-DmaxAnre 



< P^-E^l > 



n — s — 1 



< 



< 



var(yj) 
16L> 2 A 2 

u max 'n 
if 

16Anax (n - s) ' 



where the final step uses the bound (|49|h 

We now compute and bound the variance of We have 



(49) 



var(F/) 



a 4 n 2 



< 



(n — s — l) 2 (n — s) (n — s — 3) 
er 4 n 2 

(re — s — l) 2 (n — s) (n — s — 3) 
Ka 4 



A 



Ai 



1 + 
1 + 



f 



n — s — 1 
1 

n — s — 1 



(7 



-4 2 

(n-s-1) 2 1 

(n — s) (n — s — 3) 
re 2 



(re — s — l) 2 



for some constant K independent of s and re. Consequently, applying Chebyshev's inequality, we 
have 



P[F/ > 2E[y/]] = P[l7 - E[Y/] > E[y/]] < 



< 



< 



< 



var(y/) 

(E[y/]) 2 



(™-*-i) 3 ^(s 5S )-i e , 



Kn 2 C„ 



cr 4 (n - s - l) 3 

K' 
n — s — 1 



for some constant if independent of s and n. 
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D.6 Proof of Lemma [7] 

As in the proof of Lemma |3J we define and bound 

A z (i,j) := EKZi-Zj) 2 ] < 2C max v*. 

Now let (X\, . . . ,Xn) be an i.i.d. zero-mean Gaussian vector with var(Aj) = C ma xV* so that 
A x (i,j) := E[(Xi - X,) 2 ] = 2C max v*. If we set 

A* := max \A x (i,j) - A z (i,j)\ , 

then, by applying a known error bound for the Sudakov-Fernique inequality we are guaranteed 
that 

E[maxZ,-] > E[maxX,-] - y/A* log AT. (50) 

j£S c j£S c 

We now show that the quantity A* is upper bounded by 

A* < 2v* (C max - — *— ). 

(-"max 

Using the inversion formula for block-partitioned matrices |22| . we have 

S (S C |5) : = ^5 C S C - ~Ss c sC^Ss)~ ^SS C = [S 
Consequently, we have the lower bound 

EKZi-Zj) 2 } = v*(ei - e j ) T T l{S c\ s) (e i - ej) 

> 2v* A m , n (S(5c|5)) 

> 2v*K min {^- 1 ) 

2v* 



S C S C 



In turn, this leads to the upper bound 



A* = max \A x (i,j) - A z (i,j)\ 

id esc 

= max [2v*C max - A z (i,j)\ 

i,j&S c 
< 2v ( Cmax 



Cmc 

We now analyze the behavior of E[maxj g 5c Xj\. Using asymptotic results on the extrema of 
i.i.d. Gaussian sequences EHEh we have limAr^ +00 ^|™ ax jes c ^j] _ ^ Consequently, for all 5' > 0, 

\ ^(-'maxV log iV 

there exists an N(5') such that for all N > N(5'), we have 



EfmaxXj] > (1 - 5')y / 2C max v* logN. 
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Applying this lower bound to the bound (|5U|). we have 



E[max^-] > — (1 - 5') yj2C max v* log iV - ^A* log N 



> 



(1 - 5') ^/2C max v* logiV - * /2 (C, 



-) logiV 



C n 



'2^ log N. 



(51) 



First, assume that X^/v* does not diverge to infinity. Then, there exists some a > such that 

2 

?■ < a for all sufficiently large n. In this case, we have from the bound (|51[) that 

1 



— E[maxZ-l > ^JlogN 

A„ j6S c 



where 7 : = 



(1 — 5')yJ C max — 



-k= > 0. (Note that by choosing 5' > sufficiently 



''max r> 

^max 

small, we can always guarantee that 7 > 0, since C max > 1.) This completes the proof of condition 
(b) in the lemma statement. 

Otherwise, we may assume that A^/u* — ► +00. We compute 



1 



y/2v* log N = Vl-6 



2 log N t t 2^(1-5) N 
-b 1 (S55) i o H 



n — s — 1 



n 



2\ogN 
n — s — 1 



6 T (X 5S )-i& 



> 



1-5 / 2s log N 



n — s — 1 



We now apply the condition 



2s log N 
n — s — 1 

to obtain that 
1 



> 



1 



Oi-v 



Cmax (2 f.) / 



yCn 



Cn 



vC max {2 — eY 



— E[max^] > ^(1-5) 

A 77, J 6*3 



1 



max (7 

^max 



(2 - e) (52) 



\fC~r, 



C n 



Cmax 



fCmax (2 c)' 1 



Recall that vC max (2 — e) 2 > is fixed, and moreover that 5,5' > are arbitrary. Let F(5, 5') 
be the lower bound on the RHS (|52|). Note that F is a continuous function, and moreover that 



F(0,0) 



V C max — ^ 



Cmax 



\j V C m ax ~ \J C n 



(2-e) > (2-6) 



Cmax 



vCmax (2 — e) z 



Therefore, by the continuity of -F, we can choose 5,5' > sufficiently small to ensure that for some 
7 > 0, we have ^-Efmax^g^c Zj] > (2 — e) (1 + 7) for all sufficiently large n. 
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D.7 Proof of Lemma |S1 

This claim follows from the proof of Lemma |SJ 
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